19. Central Limit Theorem

Number of Outcomes

We have already seen coin flips, and how they relate to Pascal's triangle, how they evolve in to binomial distribution earlier in Problem Set 2 - Probability section.

In 19A. Central Limit Theorem section, Sebastian ups the game with tossing a die with 3 outcomes (1,2,3). While we could follow pattern to calculate results, visualizing how that outcomes would be could be helpful to strengthen our insights. So I have created another set of functions that could help us do that.

In [1]:
# just a helper function for easier youtube call
def strip_url(url):
    return url.replace('https://youtu.be/','')

from IPython.display import YouTubeVideo
url = 'https://youtu.be/jajxhNBnbmI'
YouTubeVideo(strip_url(url))
Out[1]:

What is the largest sum of outcomes possible?

In given problem (die toss resulting in 0 or 1 or 2), the maximum possible outcome is 2.
If we toss 4 times, each time maximum possibility is 2, so, the largest sum we could get out of all sequences is $4 \times 2 = 8$

Counting Outcomes 5

In [2]:
url = 'https://youtu.be/4d6upzsvBTk'
YouTubeVideo(strip_url(url))
Out[2]:
In [4]:
from graphviz import Digraph
from anyflipviz import draw_graph

# define our problem
n_outcomes = 3  # 0,1,2
n_flips = 1     # as per current problem in above video

g = Digraph(strict=True)
g = draw_graph(g, n_outcomes, n_flips)
g

Out[4]:
%3 Root R 1 0 Root->1 2 1 Root->2 3 2 Root->3

As can be seen above, for just 1 flip, sum is just the outcome as there is nothing more to add. So X=0 sum 0 occurs 1 time, sum 1 occurs 1 time, and sum 2 occurs 1 time.

$ \text {Let X denote the sum of sequence. Then after 1 toss} \\ X = 0, \ n(X) = 1 \\ X = 1, \ n(X) = 1 \\ X = 2, \ n(X) = 1 $

Counting Outcomes 6

The tricky part starts here..

In [5]:
url = 'https://youtu.be/QUw97os4gAc'
YouTubeVideo(strip_url(url))
Out[5]:

Instead of following pattern established earlier and used by Sebastian above, let us visualize the problem in its entirety.

In [6]:
# define our problem
n_outcomes = 3  # 0,1,2
n_flips = 2     # as per current problem in above video

g = Digraph(strict=True)
g = draw_graph(g, n_outcomes, n_flips)
g

Out[6]:
%3 Root R 1 0 Root->1 2 1 Root->2 3 2 Root->3 4 0 1->4 5 1 1->5 6 2 1->6 7 0 2->7 8 1 2->8 9 2 2->9 10 0 3->10 11 1 3->11 12 2 3->12

Now each seqeunce will have 2 digits. For eg, take left most branch in above tree. Its 00. So sum is 0. And then we have 01, Sum = 1 and so on. Let us also tabulate these sequences for your better understanding. Along the way, I also note the sequence's sum X.

In [8]:
import pandas as pd 
from anyflipviz import get_combinations

temp_df = get_combinations(n_outcomes, n_flips)
temp_df
Out[8]:
sequence x
0 00 0
1 01 1
2 02 2
3 10 1
4 11 2
5 12 3
6 20 2
7 21 3
8 22 4

Let us now count, the no of occurances of the sums., that is n(X). Assuming equal probability for all 3 outcomes, we could also calculate p(X) alongside n(X). You could count n(X) from above table and cross verify with below table.

In [10]:
from anyflipviz import get_combinations_consolidated

final_df = get_combinations_consolidated(n_outcomes, n_flips)

# JUST TO HIGHLIGHT A COLUMN
# thank you: https://stackoverflow.com/questions/41654949/pandas-style-function-to-hignlight-specific-columns
coldict = {'n(x)':'yellow'}
def highlight_cols(s, coldict):
    if s.name in coldict.keys():
        return ['background-color: {}'.format(coldict[s.name])] * len(s)
    return [''] * len(s)

final_df.style.apply(highlight_cols, coldict=coldict)
Out[10]:
x n(x) p(x)
0 0 1 0.111111
1 1 2 0.222222
2 2 3 0.333333
3 3 2 0.222222
4 4 1 0.111111

Note that, the highlighted numbers exactly match what Sebastian arrives at. While we recognize and use patterns, it is important to also be able to connect back to this foundational emergence of pattern. While visualizing tree, can help you appreciate the complexity of the problem we have simplified in to patterns, the derivation helps to keep connected to the fundamentals.

Counting Outcomes 7

In [11]:
url = 'https://youtu.be/b5gHsdQFZ3k'
YouTubeVideo(strip_url(url))
Out[11]:
In [12]:
# define our problem
n_outcomes = 3  # 0,1,2
n_flips = 3     # as per current problem in above video

g = Digraph(strict=True)
g = draw_graph(g, n_outcomes, n_flips)
g


Out[12]:
%3 Root R 1 0 Root->1 2 1 Root->2 3 2 Root->3 4 0 1->4 5 1 1->5 6 2 1->6 7 0 2->7 8 1 2->8 9 2 2->9 10 0 3->10 11 1 3->11 12 2 3->12 13 0 4->13 14 1 4->14 15 2 4->15 16 0 5->16 17 1 5->17 18 2 5->18 19 0 6->19 20 1 6->20 21 2 6->21 22 0 7->22 23 1 7->23 24 2 7->24 25 0 8->25 26 1 8->26 27 2 8->27 28 0 9->28 29 1 9->29 30 2 9->30 31 0 10->31 32 1 10->32 33 2 10->33 34 0 11->34 35 1 11->35 36 2 11->36 37 0 12->37 38 1 12->38 39 2 12->39

Whew, its big. Over to all those possible sequences..

In [13]:
temp_df = get_combinations(n_outcomes, n_flips)
temp_df
Out[13]:
sequence x
0 000 0
1 001 1
2 002 2
3 010 1
4 011 2
5 012 3
6 020 2
7 021 3
8 022 4
9 100 1
10 101 2
11 102 3
12 110 2
13 111 3
14 112 4
15 120 3
16 121 4
17 122 5
18 200 2
19 201 3
20 202 4
21 210 3
22 211 4
23 212 5
24 220 4
25 221 5
26 222 6

For just 3 flips, the problem has outgrown in to a complex one, with so many to count..Thanks to our program, now we could easily calculate the no of occurances of all possible sums.

In [14]:
from anyflipviz import get_combinations_consolidated

final_df = get_combinations_consolidated(n_outcomes, n_flips)

final_df.style.apply(highlight_cols, coldict=coldict)
Out[14]:
x n(x) p(x)
0 0 1 0.037037
1 1 3 0.111111
2 2 6 0.222222
3 3 7 0.259259
4 4 6 0.222222
5 5 3 0.111111
6 6 1 0.037037

Whoa! We got it right! :)

Take a moment to compare and wonder how this complicated evolution becomes an easy pattern counting Sebastian deploys, and this could easily be applied to any number of flips without we going through above complicated trees.

Counting Outcomes 8

Here we go, for even bigger viz..

In [15]:
url = 'https://youtu.be/zVZJkMwBOwA'
YouTubeVideo(strip_url(url))
Out[15]:
In [16]:
# define our problem
n_outcomes = 3  # 0,1,2
n_flips = 4     # as per current problem in above video

g = Digraph(strict=True)
g = draw_graph(g, n_outcomes, n_flips)
g



Out[16]:
%3 Root R 1 0 Root->1 2 1 Root->2 3 2 Root->3 4 0 1->4 5 1 1->5 6 2 1->6 7 0 2->7 8 1 2->8 9 2 2->9 10 0 3->10 11 1 3->11 12 2 3->12 13 0 4->13 14 1 4->14 15 2 4->15 16 0 5->16 17 1 5->17 18 2 5->18 19 0 6->19 20 1 6->20 21 2 6->21 22 0 7->22 23 1 7->23 24 2 7->24 25 0 8->25 26 1 8->26 27 2 8->27 28 0 9->28 29 1 9->29 30 2 9->30 31 0 10->31 32 1 10->32 33 2 10->33 34 0 11->34 35 1 11->35 36 2 11->36 37 0 12->37 38 1 12->38 39 2 12->39 40 0 13->40 41 1 13->41 42 2 13->42 43 0 14->43 44 1 14->44 45 2 14->45 46 0 15->46 47 1 15->47 48 2 15->48 49 0 16->49 50 1 16->50 51 2 16->51 52 0 17->52 53 1 17->53 54 2 17->54 55 0 18->55 56 1 18->56 57 2 18->57 58 0 19->58 59 1 19->59 60 2 19->60 61 0 20->61 62 1 20->62 63 2 20->63 64 0 21->64 65 1 21->65 66 2 21->66 67 0 22->67 68 1 22->68 69 2 22->69 70 0 23->70 71 1 23->71 72 2 23->72 73 0 24->73 74 1 24->74 75 2 24->75 76 0 25->76 77 1 25->77 78 2 25->78 79 0 26->79 80 1 26->80 81 2 26->81 82 0 27->82 83 1 27->83 84 2 27->84 85 0 28->85 86 1 28->86 87 2 28->87 88 0 29->88 89 1 29->89 90 2 29->90 91 0 30->91 92 1 30->92 93 2 30->93 94 0 31->94 95 1 31->95 96 2 31->96 97 0 32->97 98 1 32->98 99 2 32->99 100 0 33->100 101 1 33->101 102 2 33->102 103 0 34->103 104 1 34->104 105 2 34->105 106 0 35->106 107 1 35->107 108 2 35->108 109 0 36->109 110 1 36->110 111 2 36->111 112 0 37->112 113 1 37->113 114 2 37->114 115 0 38->115 116 1 38->116 117 2 38->117 118 0 39->118 119 1 39->119 120 2 39->120
In [17]:
temp_df = get_combinations(n_outcomes, n_flips)
temp_df
Out[17]:
sequence x
0 0000 0
1 0001 1
2 0002 2
3 0010 1
4 0011 2
5 0012 3
6 0020 2
7 0021 3
8 0022 4
9 0100 1
10 0101 2
11 0102 3
12 0110 2
13 0111 3
14 0112 4
15 0120 3
16 0121 4
17 0122 5
18 0200 2
19 0201 3
20 0202 4
21 0210 3
22 0211 4
23 0212 5
24 0220 4
25 0221 5
26 0222 6
27 1000 1
28 1001 2
29 1002 3
... ... ...
51 1220 5
52 1221 6
53 1222 7
54 2000 2
55 2001 3
56 2002 4
57 2010 3
58 2011 4
59 2012 5
60 2020 4
61 2021 5
62 2022 6
63 2100 3
64 2101 4
65 2102 5
66 2110 4
67 2111 5
68 2112 6
69 2120 5
70 2121 6
71 2122 7
72 2200 4
73 2201 5
74 2202 6
75 2210 5
76 2211 6
77 2212 7
78 2220 6
79 2221 7
80 2222 8

81 rows 2 columns

In [18]:
final_df = get_combinations_consolidated(n_outcomes, n_flips)

final_df.style.apply(highlight_cols, coldict=coldict)
Out[18]:
x n(x) p(x)
0 0 1 0.0123457
1 1 4 0.0493827
2 2 10 0.123457
3 3 16 0.197531
4 4 19 0.234568
5 5 16 0.197531
6 6 10 0.123457
7 7 4 0.0493827
8 8 1 0.0123457

Central Limit Theorem

In [19]:
url = 'https://youtu.be/RAGZjOOc7o8'
YouTubeVideo(strip_url(url))
Out[19]:

There is no question here, I just added, for proper conclusion of the pattern we observed so far. Let us draw as well for n_flips = 4, just to see, how the graph looks. What is to be fathomed is, why the sequences have this property? It has nothing to do with probability (which again as a density function that would happen just as a linear function of this distribution as shown on RHS ) so may be we should dig further, why this phenomenon occurs. When you encounter probability questions, and one says assume guassian distribution, the key to why so, lies in understanding this distribution we observe.

In [22]:
from anyflipviz import plot_combinations_consolidated

plot_combinations_consolidated(final_df, fontsize=12)

LHS is the graph, Sebastian talks about in above video.